Skip to content

feat: UAT runner ergonomics + demote fastmcp tool-failure tracebacks#1051

Merged
sergeykad merged 5 commits into
masterfrom
fix/uat-runner-ergonomics
Apr 24, 2026
Merged

feat: UAT runner ergonomics + demote fastmcp tool-failure tracebacks#1051
sergeykad merged 5 commits into
masterfrom
fix/uat-runner-ergonomics

Conversation

@sergeykad
Copy link
Copy Markdown
Collaborator

What does this PR do?

Cleans up the UAT runner and story harness along several axes surfaced while debugging a BAT session. No behavior changes to the production MCP server beyond one log-level downgrade.

Server-side

UAT runner ergonomics (tests/uat/run_uat.py, tests/uat/stories/run_story.py, tests/uat/README.md)

  • SuggestingArgumentParser: typo-tolerant argparse with difflib suggestions (e.g., --agants → did you mean --agents?).
  • Stdin TTY guard so run_uat.py with no pipe/file fails fast with a helpful message instead of hanging.
  • Quick-start section in the README pointing to run_story.py --all.
  • Commands corrected to uv run python; /v1 suffix requirement called out for LM Studio/Ollama.
  • Default LOG_LEVEL=WARNING for the spawned MCP subprocess (override with --mcp-env LOG_LEVEL=INFO).

Shared in-process MCP client (tests/uat/_inprocess.py)

  • One HomeAssistantSmartMCPServer + FastMCP client per agent run, shared across all stories' setup/verify/teardown. Previously rebuilt per phase per story; construction is ~1.5s, savings ~150s on a 50-story run.
  • verify_ha_checks(mcp_client) now takes the client as a required argument. _mcp_context in verify_story.py was a duplicate and has been removed.
  • The pytest mcp_client fixture in tests/uat/stories/conftest.py also uses the shared helper (bonus: it now picks up the websocket_manager.disconnect() that the old fixture skipped).

Logging migration (tests/uat/*)

  • Replaced four duplicated def log() helpers (thin print wrappers) with stdlib logging module usage throughout. Each UAT file uses an explicit logging.getLogger("uat.<module>") so the namespace level filter works regardless of script vs. module invocation.
  • New tests/uat/_logging.py::configure_cli_logging() — single place that sets root=WARNING / uat.*=INFO. Silences httpx/openai/mcp INFO chatter without suppressing our trace.
  • Level-promoted: FATAL: ...logger.critical, ERROR: ...logger.error/logger.exception, check failures and summary-level story failures → logger.warning.
  • Stripped the Pydantic errors.pydantic.dev URL footer from client-side validation error echoes (tests/uat/openai_agent.py::_strip_pydantic_url).

Summary correctness

  • The Summary block in run_story.py now uses _compute_passed (the same logic as the JSONL records written via append_result). Previously it only considered the test-prompt exit code, so stories that failed ha_checks but exited 0 were marked PASS in the summary but FAIL in the JSONL. These now agree.
  • append_result signature changed: exit_code=passed= (single source of truth, computed once per story).
  • Summary now prints total wall time.

Type of change

  • 🐛 Bug fix (summary PASS/FAIL mismatch; fastmcp validation traceback noise)
  • ✨ New feature
  • 📚 Documentation
  • 🔧 Maintenance/refactor (logging migration, shared MCP client)
  • 🧪 Tests only
  • 💥 Breaking change

Testing

  • I have tested these changes with a LLM agent (one BAT run of s01_automation_sunset_lights against LM Studio verified: clean WARNING logs, no URL, one server construction, correct FAIL summary from _compute_passed)
  • All automated tests pass (uv run pytest tests/uat/stories/test_verify_story.py — 12/12)
  • Code follows style guidelines (uv run ruff check — clean)

Checklist

  • I have updated documentation if needed (UAT README)

Sergey added 2 commits April 24, 2026 19:05
- Demote fastmcp tool-validation/FastMCPError tracebacks to one-line
  WARNING via ToolValidationLogFilter (pending upstream fastmcp PR #4029).
- Runner ergonomics: argparse typo suggestions, TTY stdin guard, clearer
  startup progress, UAT README quick-start, LOG_LEVEL=WARNING default for
  spawned MCP subprocess.
- Share one in-process FastMCP server per agent run (new
  tests/uat/_inprocess.py) instead of rebuilding per setup/verify phase.
  Saves ~1.5s per story-phase; ~150s on a 50-story run.
- Migrate all UAT scripts from print-based log() to stdlib logging; keep
  uat.* namespace at INFO while silencing httpx/openai/mcp INFO chatter.
- Fix Summary pass/fail: now uses _compute_passed (same logic as JSONL
  records), so stories that fail ha_checks but exit 0 are correctly
  marked FAIL.
- Drop duplicated client-side validation error echo (server log already
  shows the failure details).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the ergonomics, performance, and logging reliability of the UAT runner and story harness. By centralizing logging, optimizing MCP client lifecycle management, and refining CLI usability, these changes make the test suite faster and easier to debug without altering production behavior.

Highlights

  • Logging Improvements: Migrated UAT scripts to use the standard logging module instead of print wrappers, allowing for better control over log levels and namespace filtering. Added a ToolValidationLogFilter to demote noisy Pydantic and FastMCP validation tracebacks to single-line warnings.
  • UAT Runner Ergonomics: Introduced a typo-tolerant argument parser, added a TTY guard to prevent hanging, and updated the README with quick-start instructions and command corrections.
  • Performance Optimization: Implemented a shared in-process MCP client for UAT stories, significantly reducing overhead by reusing the server instance across setup, verification, and teardown phases.
  • Summary Correctness: Unified the PASS/FAIL logic in the story runner to ensure the summary output matches the generated JSONL records, and added total wall-time reporting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the User Acceptance Testing (UAT) suite to improve logging, error reporting, and resource efficiency. It introduces a ToolValidationLogFilter to demote noisy tracebacks, standardizes on the Python logging module across CLI tools, and implements a shared in-process FastMCP client context to reduce server startup overhead. Feedback focuses on adhering to the project's requirement for type hints in all function signatures and improving the robustness of the story runner by handling setup failures more gracefully to avoid terminating the entire test suite.

Comment thread tests/uat/_inprocess.py Outdated
Comment thread tests/uat/stories/run_story.py Outdated
Comment thread tests/uat/stories/run_story.py
Comment thread tests/uat/stories/scripts/verify_story.py Outdated
Sergey added 2 commits April 24, 2026 21:37
run_start was shadowed inside the per-story loop by a variable tracking
the test prompt start (used for session file discovery). Rename the inner
variable to prompt_start so the Summary's elapsed calculation reflects
the whole run.
@sergeykad sergeykad marked this pull request as ready for review April 24, 2026 18:43
@sergeykad sergeykad requested review from a team and julienld April 24, 2026 18:43
Copy link
Copy Markdown
Member

@kingpanther13 kingpanther13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the UAT cleanup — the shared MCP client savings, the _compute_passed consolidation, and the unified uat.* logger namespace are all solid wins. A few items worth addressing before merge:

Important

  1. ToolValidationLogFilter is wider than the stated intent (src/ha_mcp/__main__.py:387) — isinstance(err, FastMCPError) also matches AuthorizationError, ResourceError, PromptError, etc. If the goal is "tool-raised, user-visible error", tightening to isinstance(err, ToolError) would match the docstring exactly and avoid silencing a future auth failure.

  2. Filter discards exc_info / exc_text permanently (__main__.py:394-395) — any downstream handler (Sentry, structured logging, file tail) loses the stack after this filter runs. Consider stashing a one-frame summary into record.msg instead of nulling entirely, so minimal debug context survives.

  3. Commit / PR scope is misleadingfeat(uat): covers a production logging change that affects the live server, not just UAT. Release-note automation keys off these prefixes. Splitting the log filter into its own feat(internal): demote fastmcp tool-failure tracebacks commit (or retitling the PR) would keep the changelog honest.

  4. Shared MCP client is a single point of failure across stories (tests/uat/stories/run_story.py around line 786) — the prior per-call _mcp_context gave implicit isolation; reusing one client means a story that corrupts the WebSocket can poison every subsequent story in the same agent run. A short try/reconnect guard, or at minimum a docstring note acknowledging the tradeoff, would help future maintainers diagnose intermittent failures.

  5. inprocess_mcp_client finally doesn't reset websocket_manager (tests/uat/_inprocess.py:49-58) — env vars and _settings are restored, but the module-level websocket stays connected to the (possibly-stopped) test container. Consider await websocket_manager.disconnect() in the finally too, symmetrical with the entry path.

  6. Env-var mutation is process-global (_inprocess.py:37-42) — safe under current sequential usage, but worth an explicit docstring line calling out "not safe for concurrent use" so nobody later adds pytest-xdist or parametrized fixtures and hits races.

  7. _run_mcp_steps teardown swallows bare Exception at INFO level (run_story.py around lines 286-290) — a broken websocket during teardown is logged "failed, ignored" at INFO and silently poisons the next story's setup via the shared client. logger.warning (or logger.exception) would be more honest, and the catch could be narrowed to ToolError / expected transport errors.

Nice-to-have

  1. Test coverage for ToolValidationLogFilter — the sibling StatelessSessionLogFilter has tests/src/unit/test_stateless_session_log_filter.py with five targeted cases. Cloning that template for the new filter would lock in the behavior (bare Exception passes through, subclass handling, exc_info=None pass-through, right-logger/wrong-message, etc.) and guard against a future tightening regression.

  2. Lazy imports inside filter() (__main__.py:379-380) — both fastmcp.exceptions and pydantic are loaded long before any log record reaches this filter, so hoisting the imports to module scope removes a per-record sys.modules lookup in the hot path. Low-impact but free.

  3. Docstring clarity in _inprocess.py — "the env-swap and WebSocket disconnect point ha_mcp's module-level settings at the target HA instance" is accurate but a bit opaque. Something like "clearing ha_mcp.config._settings forces the next get_global_settings() call to re-read env; the websocket disconnect tears down any cached client on the previous URL" would be easier on future readers.

Strengths

  • Filter correctness verified: bare Exception and NotFoundError retain full tracebacks; pydantic.ValidationError is identity-equal to pydantic_core.ValidationError, so both fastmcp log paths are caught.
  • append_result(passed=False) default is a good failure-closed choice — the prior exit_code=0 default is exactly what caused the Summary/JSONL divergence this PR fixes.
  • verify_ha_checks(mcp_client) breaking change is safe (single caller, grep-verified).
  • SuggestingArgumentParser, stdin TTY guard, /v1 suffix note in the README, and the per-agent prompt_start vs run_start split (commit 945a4c8) — all genuine UX improvements.

Happy to iterate on any of these.

…ed client

- Narrow ToolValidationLogFilter to ToolError (was FastMCPError)
- Add websocket_manager.disconnect() in inprocess_mcp_client finally
- Log teardown failures as WARNING (was INFO) with shared-client note
- Hoist fastmcp/pydantic imports to module scope
- Document shared-client SPoF tradeoff and process-global env mutation
- Add unit tests for ToolValidationLogFilter (6 cases)
@sergeykad sergeykad changed the title feat(uat): runner ergonomics, shared MCP client, logging cleanup feat: UAT runner ergonomics + demote fastmcp tool-failure tracebacks Apr 24, 2026
@sergeykad
Copy link
Copy Markdown
Collaborator Author

Thanks for the review.

Addressed in 8869f42:

  1. Narrowed FastMCPError to ToolError so future auth/resource/prompt errors keep their stacks.
  2. Added a docstring note on the shared-client tradeoff in _inprocess.py. Reconnect guard would mask real WebSocket breakage; keeping fail-loud.
  3. Added symmetric websocket_manager.disconnect() in the finally block.
  4. Added a "not safe for concurrent use" note.
  5. Raised teardown failure log from INFO to WARNING with context that the shared client may be poisoned for the next story. Kept the broad except Exception so unexpected errors still surface rather than being hidden behind a narrow ToolError-only catch.
  6. Added tests/src/unit/test_tool_validation_log_filter.py mirroring the StatelessSessionLogFilter test pattern (6 cases: pydantic demotion, ToolError demotion, bare Exception passthrough, non-ToolError FastMCPError subclass passthrough, wrong logger, no exc_info).
  7. Hoisted fastmcp.exceptions and pydantic imports to module scope.
  8. Rewrote the _inprocess.py docstring to explain the _settings = None and websocket_manager.disconnect() semantics explicitly.

Not addressing:

  1. exc_info nulling is intentional. The structured error info (pydantic .errors() or ToolError message) lands in record.msg, and the filter's whole purpose is to replace the fastmcp/pydantic-internal stack with a single WARNING line. Validation and tool errors are user-input problems, not server bugs that warrant a stack for Sentry.
  2. Retitled the PR so release-note automation sees the correct scope.

@sergeykad sergeykad enabled auto-merge (squash) April 24, 2026 19:59
Copy link
Copy Markdown
Member

@kingpanther13 kingpanther13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround. All the code items land cleanly, the new test_tool_validation_log_filter.py is exactly the right shape (the FutureAuthError case nicely locks in that non-ToolError subclasses keep their stacks), and the rewritten _inprocess.py docstring is a lot clearer.

On the exc_info nulling — fair call. With the filter narrowed to ToolError / pydantic ValidationError and the structured detail folded into record.msg, these really are user-input errors rather than bugs needing a Sentry-grade stack. The docstring now makes that intent explicit, which is what I was really after.

LGTM.

@sergeykad sergeykad merged commit 7c4836b into master Apr 24, 2026
19 checks passed
@sergeykad sergeykad deleted the fix/uat-runner-ergonomics branch April 24, 2026 20:04
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Your changes are now in the dev channel!

Your PR has been merged to master and is available for testing in the dev channel.

Test your changes before the next stable release (biweekly Wednesday):
📖 Dev Channel Documentation

Quick start

# Run dev version
uvx ha-mcp-dev

# Check version
uvx ha-mcp-dev --version

Docker:

docker pull ghcr.io/homeassistant-ai/ha-mcp:dev
docker run --rm -i \
  -e HOMEASSISTANT_URL=http://your-ha:8123 \
  -e HOMEASSISTANT_TOKEN=your_token \
  ghcr.io/homeassistant-ai/ha-mcp:dev

Found an issue? Please open a new bug report and mention this PR for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants